The purpose of this project is to use regression to predict the number of shares of articles by mashable.
Dataset source: http://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
abstract: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity). Data Set Characteristics: Multivariate Number of Instances: 39797 Area: Business
Attribute Characteristics:Integer, Real
Number of Attributes:61
Date Donated: 2015-05-31
Data Set Information:
Attribute Information: Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field).
Attribute Information: 0. url: URL of the article (non-predictive) – not to be included in the analysis 1. timedelta: Days between the article publication and the dataset acquisition (non-predictive) – not to be included in the analysis 2. n_tokens_title: Number of words in the title 3. n_tokens_content: Number of words in the content 4. n_unique_tokens: Rate of unique words in the content 5. n_non_stop_words: Rate of non-stop words in the content Examples of stop words: Determiners- Determiners tend to mark nouns where a determiner usually will be followed by a noun examples: the, a, an, another Coordinating conjunctions– Coordinating conjunctions connect words, phrases, and clauses examples: for, an, nor, but, or, yet, so Prepositions- Prepositions express temporal or spatial relations examples: in, under, towards, before
n_non_stop_unique_tokens: Rate of unique non-stop words in the content
num_hrefs: Number of links
num_self_hrefs: Number of links to other articles published by Mashable
num_imgs: Number of images
num_videos: Number of videos
average_token_length: Average length of the words in the content
num_keywords: Number of keywords in the metadata
data_channel_is_world: Is data channel ‘World’?
kw_avg_avg: Avg. keyword (avg. shares)
self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
is_weekend: Was the article published on the weekend?
LDA_04: Closeness to LDA topic 4
rate_negative_words: Rate of negative words among non-neutral tokens
max_negative_polarity: Max. polarity of negative words
abs_title_sentiment_polarity: Absolute polarity level
shares: Number of shares (target)
We have based some of our analysis on the paper “Predicting and Evaluating the Popularity of Online News” that can be found here: http://cs229.stanford.edu/proj2015/328_report.pdf
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
data = read.csv("OnlineNewsPopularity.csv")
str(data)
## 'data.frame': 39644 obs. of 61 variables:
## $ url : Factor w/ 39644 levels "http://mashable.com/2013/01/07/amazon-instant-video-browser/",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ timedelta : num 731 731 731 731 731 731 731 731 731 731 ...
## $ n_tokens_title : num 12 9 9 9 13 10 8 12 11 10 ...
## $ n_tokens_content : num 219 255 211 531 1072 ...
## $ n_unique_tokens : num 0.664 0.605 0.575 0.504 0.416 ...
## $ n_non_stop_words : num 1 1 1 1 1 ...
## $ n_non_stop_unique_tokens : num 0.815 0.792 0.664 0.666 0.541 ...
## $ num_hrefs : num 4 3 3 9 19 2 21 20 2 4 ...
## $ num_self_hrefs : num 2 1 1 0 19 2 20 20 0 1 ...
## $ num_imgs : num 1 1 1 1 20 0 20 20 0 1 ...
## $ num_videos : num 0 0 0 0 0 0 0 0 0 1 ...
## $ average_token_length : num 4.68 4.91 4.39 4.4 4.68 ...
## $ num_keywords : num 5 4 6 7 7 9 10 9 7 5 ...
## $ data_channel_is_lifestyle : num 0 0 0 0 0 0 1 0 0 0 ...
## $ data_channel_is_entertainment: num 1 0 0 1 0 0 0 0 0 0 ...
## $ data_channel_is_bus : num 0 1 1 0 0 0 0 0 0 0 ...
## $ data_channel_is_socmed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ data_channel_is_tech : num 0 0 0 0 1 1 0 1 1 0 ...
## $ data_channel_is_world : num 0 0 0 0 0 0 0 0 0 1 ...
## $ kw_min_min : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_max_min : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_avg_min : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_min_max : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_max_max : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_avg_max : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_min_avg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_max_avg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_avg_avg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ self_reference_min_shares : num 496 0 918 0 545 8500 545 545 0 0 ...
## $ self_reference_max_shares : num 496 0 918 0 16000 8500 16000 16000 0 0 ...
## $ self_reference_avg_sharess : num 496 0 918 0 3151 ...
## $ weekday_is_monday : num 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_tuesday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_wednesday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_thursday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_friday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_saturday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_sunday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ is_weekend : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LDA_00 : num 0.5003 0.7998 0.2178 0.0286 0.0286 ...
## $ LDA_01 : num 0.3783 0.05 0.0333 0.4193 0.0288 ...
## $ LDA_02 : num 0.04 0.0501 0.0334 0.4947 0.0286 ...
## $ LDA_03 : num 0.0413 0.0501 0.0333 0.0289 0.0286 ...
## $ LDA_04 : num 0.0401 0.05 0.6822 0.0286 0.8854 ...
## $ global_subjectivity : num 0.522 0.341 0.702 0.43 0.514 ...
## $ global_sentiment_polarity : num 0.0926 0.1489 0.3233 0.1007 0.281 ...
## $ global_rate_positive_words : num 0.0457 0.0431 0.0569 0.0414 0.0746 ...
## $ global_rate_negative_words : num 0.0137 0.01569 0.00948 0.02072 0.01213 ...
## $ rate_positive_words : num 0.769 0.733 0.857 0.667 0.86 ...
## $ rate_negative_words : num 0.231 0.267 0.143 0.333 0.14 ...
## $ avg_positive_polarity : num 0.379 0.287 0.496 0.386 0.411 ...
## $ min_positive_polarity : num 0.1 0.0333 0.1 0.1364 0.0333 ...
## $ max_positive_polarity : num 0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
## $ avg_negative_polarity : num -0.35 -0.119 -0.467 -0.37 -0.22 ...
## $ min_negative_polarity : num -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
## $ max_negative_polarity : num -0.2 -0.1 -0.133 -0.167 -0.05 ...
## $ title_subjectivity : num 0.5 0 0 0 0.455 ...
## $ title_sentiment_polarity : num -0.188 0 0 0 0.136 ...
## $ abs_title_subjectivity : num 0 0.5 0.5 0.5 0.0455 ...
## $ abs_title_sentiment_polarity : num 0.188 0 0 0 0.136 ...
## $ shares : int 593 711 1500 1200 505 855 556 891 3600 710 ...
set.seed(27)
#Create a sample of the data to reduce the time of processing it
dsmall <- sample_n(data, 10000)
First, generate a global summary of the dataset to understand how the values are distributed
summary(dsmall)
## url
## http://mashable.com/2013/01/07/amazon-instant-video-browser/ : 1
## http://mashable.com/2013/01/07/apple-40-billion-app-downloads/: 1
## http://mashable.com/2013/01/07/att-u-verse-apps/ : 1
## http://mashable.com/2013/01/07/bodymedia-armbandgets-update/ : 1
## http://mashable.com/2013/01/07/cosmic-events-doomsday/ : 1
## http://mashable.com/2013/01/07/earth-size-planets-milky-way/ : 1
## (Other) :9994
## timedelta n_tokens_title n_tokens_content n_unique_tokens
## Min. : 8.0 Min. : 2.0 Min. : 0.0 Min. :0.0000
## 1st Qu.:164.0 1st Qu.: 9.0 1st Qu.: 245.0 1st Qu.:0.4707
## Median :338.0 Median :10.0 Median : 408.0 Median :0.5410
## Mean :355.3 Mean :10.4 Mean : 541.7 Mean :0.5320
## 3rd Qu.:543.0 3rd Qu.:12.0 3rd Qu.: 709.2 3rd Qu.:0.6104
## Max. :731.0 Max. :19.0 Max. :7185.0 Max. :1.0000
##
## n_non_stop_words n_non_stop_unique_tokens num_hrefs
## Min. :0.0000 Min. :0.0000 Min. : 0.00
## 1st Qu.:1.0000 1st Qu.:0.6261 1st Qu.: 4.00
## Median :1.0000 Median :0.6910 Median : 7.00
## Mean :0.9713 Mean :0.6741 Mean : 10.83
## 3rd Qu.:1.0000 3rd Qu.:0.7557 3rd Qu.: 14.00
## Max. :1.0000 Max. :1.0000 Max. :145.00
##
## num_self_hrefs num_imgs num_videos average_token_length
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. :0.000
## 1st Qu.: 1.00 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.:4.479
## Median : 3.00 Median : 1.000 Median : 0.000 Median :4.666
## Mean : 3.31 Mean : 4.494 Mean : 1.236 Mean :4.552
## 3rd Qu.: 4.00 3rd Qu.: 4.000 3rd Qu.: 1.000 3rd Qu.:4.855
## Max. :116.00 Max. :111.000 Max. :91.000 Max. :7.696
##
## num_keywords data_channel_is_lifestyle data_channel_is_entertainment
## Min. : 1.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 6.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 7.000 Median :0.0000 Median :0.0000
## Mean : 7.223 Mean :0.0547 Mean :0.1788
## 3rd Qu.: 9.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## data_channel_is_bus data_channel_is_socmed data_channel_is_tech
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1543 Mean :0.0594 Mean :0.1859
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## data_channel_is_world kw_min_min kw_max_min kw_avg_min
## Min. :0.0000 Min. : -1.00 Min. : 0 Min. : -1.0
## 1st Qu.:0.0000 1st Qu.: -1.00 1st Qu.: 444 1st Qu.: 138.9
## Median :0.0000 Median : -1.00 Median : 653 Median : 233.3
## Mean :0.2137 Mean : 26.43 Mean : 1183 Mean : 320.2
## 3rd Qu.:0.0000 3rd Qu.: 4.00 3rd Qu.: 1000 3rd Qu.: 352.6
## Max. :1.0000 Max. :377.00 Max. :138700 Max. :34855.1
##
## kw_min_max kw_max_max kw_avg_max kw_min_avg
## Min. : 0 Min. : 0 Min. : 0 Min. : -1
## 1st Qu.: 0 1st Qu.:843300 1st Qu.:171300 1st Qu.: 0
## Median : 1400 Median :843300 Median :245263 Median :1022
## Mean : 14201 Mean :751491 Mean :259888 Mean :1112
## 3rd Qu.: 7825 3rd Qu.:843300 3rd Qu.:332258 3rd Qu.:2051
## Max. :843300 Max. :843300 Max. :843300 Max. :3607
##
## kw_max_avg kw_avg_avg self_reference_min_shares
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 3564 1st Qu.: 2372 1st Qu.: 629
## Median : 4332 Median : 2871 Median : 1200
## Mean : 5755 Mean : 3148 Mean : 4027
## 3rd Qu.: 6016 3rd Qu.: 3591 3rd Qu.: 2600
## Max. :237967 Max. :37608 Max. :690400
##
## self_reference_max_shares self_reference_avg_sharess weekday_is_monday
## Min. : 0 Min. : 0.0 Min. :0.0000
## 1st Qu.: 1100 1st Qu.: 979.6 1st Qu.:0.0000
## Median : 2800 Median : 2200.0 Median :0.0000
## Mean : 10771 Mean : 6618.6 Mean :0.1692
## 3rd Qu.: 7925 3rd Qu.: 5200.0 3rd Qu.:0.0000
## Max. :843300 Max. :690400.0 Max. :1.0000
##
## weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1894 Mean :0.1885 Mean :0.1805
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1402 Mean :0.0624 Mean :0.0698 Mean :0.1322
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## LDA_00 LDA_01 LDA_02 LDA_03
## Min. :0.01818 Min. :0.01819 Min. :0.01818 Min. :0.01818
## 1st Qu.:0.02505 1st Qu.:0.02501 1st Qu.:0.02857 1st Qu.:0.02593
## Median :0.03339 Median :0.03335 Median :0.04001 Median :0.04000
## Mean :0.18271 Mean :0.14098 Mean :0.21778 Mean :0.22391
## 3rd Qu.:0.23935 3rd Qu.:0.15017 3rd Qu.:0.33654 3rd Qu.:0.36705
## Max. :0.92000 Max. :0.92595 Max. :0.92000 Max. :0.92554
##
## LDA_04 global_subjectivity global_sentiment_polarity
## Min. :0.01818 Min. :0.0000 Min. :-0.37766
## 1st Qu.:0.02857 1st Qu.:0.3942 1st Qu.: 0.05604
## Median :0.04282 Median :0.4538 Median : 0.11861
## Mean :0.23462 Mean :0.4438 Mean : 0.11873
## 3rd Qu.:0.39910 3rd Qu.:0.5098 3rd Qu.: 0.17797
## Max. :0.91999 Max. :1.0000 Max. : 0.62500
##
## global_rate_positive_words global_rate_negative_words rate_positive_words
## Min. :0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.02820 1st Qu.:0.009615 1st Qu.:0.6000
## Median :0.03891 Median :0.015409 Median :0.7074
## Mean :0.03952 Mean :0.016647 Mean :0.6820
## 3rd Qu.:0.05021 3rd Qu.:0.021739 3rd Qu.:0.8000
## Max. :0.15217 Max. :0.184932 Max. :1.0000
##
## rate_negative_words avg_positive_polarity min_positive_polarity
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.1852 1st Qu.:0.3042 1st Qu.:0.05000
## Median :0.2857 Median :0.3583 Median :0.10000
## Mean :0.2891 Mean :0.3538 Mean :0.09686
## 3rd Qu.:0.3846 3rd Qu.:0.4132 3rd Qu.:0.10000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
##
## max_positive_polarity avg_negative_polarity min_negative_polarity
## Min. :0.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:0.6000 1st Qu.:-0.3304 1st Qu.:-0.7000
## Median :0.8000 Median :-0.2542 Median :-0.5000
## Mean :0.7535 Mean :-0.2614 Mean :-0.5225
## 3rd Qu.:1.0000 3rd Qu.:-0.1883 3rd Qu.:-0.3000
## Max. :1.0000 Max. : 0.0000 Max. : 0.0000
##
## max_negative_polarity title_subjectivity title_sentiment_polarity
## Min. :-1.0000 Min. :0.0000 Min. :-1.00000
## 1st Qu.:-0.1250 1st Qu.:0.0000 1st Qu.: 0.00000
## Median :-0.1000 Median :0.1250 Median : 0.00000
## Mean :-0.1099 Mean :0.2782 Mean : 0.06865
## 3rd Qu.:-0.0500 3rd Qu.:0.5000 3rd Qu.: 0.13636
## Max. : 0.0000 Max. :1.0000 Max. : 1.00000
##
## abs_title_subjectivity abs_title_sentiment_polarity shares
## Min. :0.0000 Min. :0.0000 Min. : 5
## 1st Qu.:0.1667 1st Qu.:0.0000 1st Qu.: 945
## Median :0.5000 Median :0.0000 Median : 1400
## Mean :0.3439 Mean :0.1554 Mean : 3211
## 3rd Qu.:0.5000 3rd Qu.:0.2500 3rd Qu.: 2700
## Max. :0.5000 Max. :1.0000 Max. :652900
##
The main finding of this preliminary analysis is that there is a big difference between ther 3rd Quartile (2700) of the number of shares and its maximum (652900). Having some extreme values for the number of shares will most probably generate a negative effect on the quality of the predictions that we will make.
As a second step of the exploratory analysis, we generated histograms to understand the way different variables are distributed. We didn’t perform the analysis for binary variables because histograms don’t add much value to understanding their distribution
hist(dsmall$ n_tokens_title)
hist(dsmall$ n_tokens_content)
hist(dsmall$ n_unique_tokens)
hist(dsmall$ n_non_stop_words)
hist(dsmall$ n_non_stop_unique_tokens)
hist(dsmall$ num_hrefs)
hist(dsmall$ num_self_hrefs)
hist(dsmall$ num_imgs)
hist(dsmall$ num_videos)
hist(dsmall$ average_token_length)
hist(dsmall$ num_keywords)
hist(dsmall$ kw_min_min)
hist(dsmall$ kw_max_min)
hist(dsmall$ kw_avg_min)
hist(dsmall$ kw_min_max)
hist(dsmall$ kw_max_max)
hist(dsmall$ kw_avg_max)
hist(dsmall$ kw_min_avg)
hist(dsmall$ kw_max_avg)
hist(dsmall$ kw_avg_avg)
hist(dsmall$ self_reference_min_shares)
hist(dsmall$ self_reference_max_shares)
hist(dsmall$ self_reference_avg_sharess)
hist(dsmall$ LDA_00)
hist(dsmall$ LDA_01)
hist(dsmall$ LDA_02)
hist(dsmall$ LDA_03)
hist(dsmall$ LDA_04)
hist(dsmall$ global_subjectivity)
hist(dsmall$ global_sentiment_polarity)
hist(dsmall$ global_rate_positive_words)
hist(dsmall$ global_rate_negative_words)
hist(dsmall$ rate_positive_words)
hist(dsmall$ rate_negative_words)
hist(dsmall$ avg_positive_polarity)
hist(dsmall$ min_positive_polarity)
hist(dsmall$ max_positive_polarity)
hist(dsmall$ avg_negative_polarity)
hist(dsmall$ min_negative_polarity)
hist(dsmall$ max_negative_polarity)
hist(dsmall$ title_subjectivity)
hist(dsmall$ title_sentiment_polarity)
hist(dsmall$ abs_title_subjectivity)
hist(dsmall$ abs_title_sentiment_polarity)
qplot(shares, data=dsmall, geom="histogram",xlim=c(0,40000),ylim=c(0,1500))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 63 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
As a main conclusion, we confirm that the number of shares is mainly very low, but that there are very few exceptions that have a very high number of shares.
Since we have a large number of variables, we will start by creating different variable categories and perform regression analysis on each block. We are doing this to reduce the computational time of running the analysis with a very large number of variables and also as a way to understand better the type of variables the dataset has.
We will pick the best variables of each block and take them into the final pool of variables for running a global model.
Example: Create a Words category that includes all variables related to words: tokens title, tokens content, non-stop words..
Here we will create datasets for all the categories with the variables:
words = dsmall %>%
select(n_tokens_title, n_tokens_content, n_unique_tokens, n_non_stop_words, n_non_stop_unique_tokens, average_token_length, shares)
links = dsmall %>% select( num_hrefs, num_self_hrefs,self_reference_min_shares,
self_reference_max_shares, self_reference_avg_sharess, shares)
media = dsmall %>%
select(num_imgs,num_videos,shares)
time = dsmall %>%
select( weekday_is_monday, weekday_is_tuesday, weekday_is_wednesday, weekday_is_thursday, weekday_is_friday, weekday_is_saturday, weekday_is_sunday, is_weekend, shares)
keywords = dsmall %>%
select( num_keywords, data_channel_is_lifestyle, data_channel_is_entertainment, data_channel_is_bus, data_channel_is_socmed, data_channel_is_tech, data_channel_is_world, kw_min_min, kw_max_min, kw_avg_min, kw_min_max, kw_max_max, kw_avg_max, kw_min_avg, kw_max_avg, kw_avg_avg, shares)
nlp = dsmall %>%
select( LDA_00, LDA_01, LDA_02, LDA_03, LDA_04, global_subjectivity, global_sentiment_polarity, global_rate_positive_words, global_rate_negative_words, rate_positive_words, rate_negative_words, avg_positive_polarity, min_positive_polarity, max_positive_polarity, avg_negative_polarity, min_negative_polarity, max_negative_polarity, title_subjectivity, title_sentiment_polarity, abs_title_subjectivity, shares)
We now have 5 categories: Words, Links, Media, Time, Keywords and NLP (for natural language processing).
We will now start by running the regression method on each of the categories in order to determine the most significant variables and discard the least significant ones
We will be using a maximum tolerable p-value of 5% to select the variables that we would like to keep for building the global model.
m1 = lm(data = words, shares ~ .)
summary(m1)
##
## Call:
## lm(formula = shares ~ ., data = words)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5445 -2238 -1533 -324 648776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.731e+03 7.691e+02 6.151 7.99e-10 ***
## n_tokens_title 1.779e+01 4.633e+01 0.384 0.7010
## n_tokens_content 7.600e-01 3.265e-01 2.327 0.0200 *
## n_unique_tokens 1.585e+04 2.637e+03 6.010 1.92e-09 ***
## n_non_stop_words 4.116e+03 2.137e+03 1.926 0.0541 .
## n_non_stop_unique_tokens -1.159e+04 2.237e+03 -5.181 2.25e-07 ***
## average_token_length -1.479e+03 3.661e+02 -4.041 5.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9703 on 9993 degrees of freedom
## Multiple R-squared: 0.005427, Adjusted R-squared: 0.00483
## F-statistic: 9.087 on 6 and 9993 DF, p-value: 6.162e-10
#Remove the variables with a p-value > 0.05 and re-run the model
summary(lm(data=words, shares~n_unique_tokens+ n_non_stop_unique_tokens+ average_token_length+ n_tokens_content ))
##
## Call:
## lm(formula = shares ~ n_unique_tokens + n_non_stop_unique_tokens +
## average_token_length + n_tokens_content, data = words)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6104 -2218 -1559 -345 648999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5227.9090 551.7818 9.475 < 2e-16 ***
## n_unique_tokens 15420.7248 2628.0951 5.868 4.56e-09 ***
## n_non_stop_unique_tokens -9929.6075 2068.4728 -4.800 1.61e-06 ***
## average_token_length -888.5832 193.7034 -4.587 4.54e-06 ***
## n_tokens_content 0.9554 0.3109 3.073 0.00212 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9704 on 9995 degrees of freedom
## Multiple R-squared: 0.005033, Adjusted R-squared: 0.004635
## F-statistic: 12.64 on 4 and 9995 DF, p-value: 2.921e-10
m2 = lm(data = links, shares ~ .)
summary(m2)
##
## Call:
## lm(formula = shares ~ ., data = links)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45533 -2081 -1600 -387 630670
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.659e+03 1.439e+02 18.477 < 2e-16 ***
## num_hrefs 4.059e+01 9.652e+00 4.205 2.63e-05 ***
## num_self_hrefs -5.487e+01 2.644e+01 -2.075 0.038 *
## self_reference_min_shares 5.512e-02 1.025e-02 5.377 7.75e-08 ***
## self_reference_max_shares 4.430e-04 5.687e-03 0.078 0.938
## self_reference_avg_sharess 1.007e-02 1.387e-02 0.726 0.468
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9628 on 9994 degrees of freedom
## Multiple R-squared: 0.02074, Adjusted R-squared: 0.02025
## F-statistic: 42.34 on 5 and 9994 DF, p-value: < 2.2e-16
#Again, remove the variable with p-value > 0.05 and re-run the modelw it
summary(lm(data = links, shares ~ num_hrefs + num_self_hrefs +self_reference_min_shares))
##
## Call:
## lm(formula = shares ~ num_hrefs + num_self_hrefs + self_reference_min_shares,
## data = links)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45564 -2091 -1607 -387 630648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.675e+03 1.428e+02 18.728 < 2e-16 ***
## num_hrefs 4.083e+01 9.639e+00 4.236 2.29e-05 ***
## num_self_hrefs -5.160e+01 2.595e+01 -1.988 0.0468 *
## self_reference_min_shares 6.564e-02 4.781e-03 13.731 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9628 on 9996 degrees of freedom
## Multiple R-squared: 0.02039, Adjusted R-squared: 0.0201
## F-statistic: 69.37 on 3 and 9996 DF, p-value: < 2.2e-16
m3 = lm(data = media, shares ~ .)
summary(m3)
##
## Call:
## lm(formula = shares ~ ., data = media)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5929 -2199 -1712 -449 649826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2970.84 115.42 25.738 < 2e-16 ***
## num_imgs 39.22 11.70 3.352 0.000804 ***
## num_videos 51.55 23.97 2.150 0.031565 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9720 on 9997 degrees of freedom
## Multiple R-squared: 0.001485, Adjusted R-squared: 0.001285
## F-statistic: 7.433 on 2 and 9997 DF, p-value: 0.0005947
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
m4 = lm(data = time, shares ~ .)
summary(m4)
##
## Call:
## lm(formula = shares ~ ., data = time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3760 -2230 -1758 -495 649362
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3706.6 368.1 10.069 <2e-16 ***
## weekday_is_monday -168.4 437.5 -0.385 0.7003
## weekday_is_tuesday -711.3 430.6 -1.652 0.0986 .
## weekday_is_wednesday -738.5 430.9 -1.714 0.0866 .
## weekday_is_thursday -682.0 433.5 -1.573 0.1157
## weekday_is_friday -549.0 450.5 -1.219 0.2230
## weekday_is_saturday 107.1 535.8 0.200 0.8415
## weekday_is_sunday NA NA NA NA
## is_weekend NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9725 on 9993 degrees of freedom
## Multiple R-squared: 0.0008936, Adjusted R-squared: 0.0002937
## F-statistic: 1.49 on 6 and 9993 DF, p-value: 0.1772
#since we have a lot of variables that are not significant, we will run the step-wise function to identify the most significant variables.
null= lm(data=time, shares ~ 1)
full = lm(data=time, shares ~ .)
step = stepAIC(null, scope=list(lower=null, upper=full), direction = "forward")
## Start: AIC=183653.2
## shares ~ 1
##
## Df Sum of Sq RSS AIC
## + is_weekend 1 454744773 9.4550e+11 183650
## + weekday_is_saturday 1 241926056 9.4571e+11 183653
## + weekday_is_monday 1 218323983 9.4574e+11 183653
## <none> 9.4596e+11 183653
## + weekday_is_sunday 1 184453042 9.4577e+11 183653
## + weekday_is_wednesday 1 136829546 9.4582e+11 183654
## + weekday_is_tuesday 1 108469518 9.4585e+11 183654
## + weekday_is_thursday 1 76359028 9.4588e+11 183654
## + weekday_is_friday 1 4618178 9.4595e+11 183655
##
## Step: AIC=183650.4
## shares ~ is_weekend
##
## Df Sum of Sq RSS AIC
## + weekday_is_monday 1 354428670 9.4515e+11 183649
## <none> 9.4550e+11 183650
## + weekday_is_wednesday 1 61241203 9.4544e+11 183652
## + weekday_is_tuesday 1 42360690 9.4546e+11 183652
## + weekday_is_thursday 1 24160547 9.4548e+11 183652
## + weekday_is_saturday 1 3780575 9.4550e+11 183652
## + weekday_is_sunday 1 3780575 9.4550e+11 183652
## + weekday_is_friday 1 1506309 9.4550e+11 183652
##
## Step: AIC=183648.6
## shares ~ is_weekend + weekday_is_monday
##
## Df Sum of Sq RSS AIC
## <none> 9.4515e+11 183649
## + weekday_is_friday 1 29402354 9.4512e+11 183650
## + weekday_is_wednesday 1 9298260 9.4514e+11 183651
## + weekday_is_saturday 1 3780575 9.4514e+11 183651
## + weekday_is_sunday 1 3780575 9.4514e+11 183651
## + weekday_is_tuesday 1 2790474 9.4514e+11 183651
## + weekday_is_thursday 1 29884 9.4515e+11 183651
#We will run the model again with the chosen variables
summary(lm(data = time , shares ~ is_weekend + weekday_is_monday ))
##
## Call:
## lm(formula = shares ~ is_weekend + weekday_is_monday, data = time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3703 -2238 -1728 -528 649362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3028.1 116.3 26.030 <2e-16 ***
## is_weekend 729.0 291.6 2.500 0.0124 *
## weekday_is_monday 510.1 263.5 1.936 0.0529 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9723 on 9997 degrees of freedom
## Multiple R-squared: 0.0008554, Adjusted R-squared: 0.0006555
## F-statistic: 4.279 on 2 and 9997 DF, p-value: 0.01388
#This model shows a high significance for is_weekend, so we will drop weekday_is_monday
detach("package:MASS")
m5 = lm(data = keywords, shares ~ .)
summary(m5)
##
## Call:
## lm(formula = shares ~ ., data = keywords)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23591 -2002 -1224 -214 647794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.590e+01 9.645e+02 -0.037 0.970308
## num_keywords 6.024e+01 6.061e+01 0.994 0.320310
## data_channel_is_lifestyle -9.323e+02 5.189e+02 -1.797 0.072384 .
## data_channel_is_entertainment -1.255e+03 3.923e+02 -3.200 0.001377 **
## data_channel_is_bus -4.875e+02 4.013e+02 -1.215 0.224505
## data_channel_is_socmed -3.775e+02 5.103e+02 -0.740 0.459495
## data_channel_is_tech -5.829e+02 4.050e+02 -1.439 0.150136
## data_channel_is_world -9.237e+02 4.215e+02 -2.191 0.028441 *
## kw_min_min 1.732e+00 2.669e+00 0.649 0.516375
## kw_max_min 2.529e-02 8.650e-02 0.292 0.769974
## kw_avg_min -3.689e-01 4.490e-01 -0.822 0.411305
## kw_min_max -3.647e-03 1.851e-03 -1.970 0.048847 *
## kw_max_max -3.023e-04 9.414e-04 -0.321 0.748140
## kw_avg_max 1.320e-03 1.304e-03 1.012 0.311367
## kw_min_avg -4.526e-01 1.230e-01 -3.679 0.000236 ***
## kw_max_avg -1.706e-01 3.817e-02 -4.468 7.98e-06 ***
## kw_avg_avg 1.574e+00 2.217e-01 7.096 1.37e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9632 on 9983 degrees of freedom
## Multiple R-squared: 0.02092, Adjusted R-squared: 0.01935
## F-statistic: 13.33 on 16 and 9983 DF, p-value: < 2.2e-16
type = dsmall %>%
select(data_channel_is_lifestyle, data_channel_is_entertainment, data_channel_is_bus, data_channel_is_socmed, data_channel_is_tech, data_channel_is_world, shares)
m5 = lm(data = type, shares ~ .)
summary(m5)
##
## Call:
## lm(formula = shares ~ ., data = type)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5457 -1988 -1397 -339 649673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5478.6 247.2 22.165 < 2e-16 ***
## data_channel_is_lifestyle -2239.4 481.9 -4.647 3.41e-06 ***
## data_channel_is_entertainment -2881.7 336.8 -8.556 < 2e-16 ***
## data_channel_is_bus -2251.6 348.9 -6.453 1.15e-10 ***
## data_channel_is_socmed -2015.7 467.6 -4.311 1.64e-05 ***
## data_channel_is_tech -2556.8 333.8 -7.659 2.05e-14 ***
## data_channel_is_world -3217.5 323.9 -9.935 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9674 on 9993 degrees of freedom
## Multiple R-squared: 0.01128, Adjusted R-squared: 0.01069
## F-statistic: 19.01 on 6 and 9993 DF, p-value: < 2.2e-16
keys = dsmall %>%
select(kw_min_min, kw_max_min, kw_avg_min, kw_min_max, kw_max_max, kw_avg_max, kw_min_avg, kw_max_avg, kw_avg_avg, shares)
m6 = lm(data = keys, shares ~ .)
summary(m6)
##
## Call:
## lm(formula = shares ~ ., data = keys)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26141 -2050 -1272 -180 647700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.987e+02 7.438e+02 -0.805 0.4209
## kw_min_min 2.079e+00 2.667e+00 0.779 0.4357
## kw_max_min 3.457e-02 8.555e-02 0.404 0.6862
## kw_avg_min -4.473e-01 4.407e-01 -1.015 0.3100
## kw_min_max -4.369e-03 1.822e-03 -2.397 0.0165 *
## kw_max_max -6.686e-04 9.088e-04 -0.736 0.4619
## kw_avg_max 1.845e-03 1.097e-03 1.681 0.0928 .
## kw_min_avg -5.389e-01 1.160e-01 -4.647 3.41e-06 ***
## kw_max_avg -2.017e-01 3.440e-02 -5.862 4.72e-09 ***
## kw_avg_avg 1.811e+00 1.832e-01 9.887 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9636 on 9990 degrees of freedom
## Multiple R-squared: 0.01942, Adjusted R-squared: 0.01853
## F-statistic: 21.98 on 9 and 9990 DF, p-value: < 2.2e-16
summary(lm(data = keys, shares ~ kw_max_min+ kw_avg_min+ kw_max_avg+ kw_avg_avg))
##
## Call:
## lm(formula = shares ~ kw_max_min + kw_avg_min + kw_max_avg +
## kw_avg_avg, data = keys)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24921 -2111 -1355 -224 648275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -133.81918 295.31594 -0.453 0.650
## kw_max_min -0.02590 0.08235 -0.314 0.753
## kw_avg_min -0.07703 0.41447 -0.186 0.853
## kw_max_avg -0.12252 0.02805 -4.368 1.27e-05 ***
## kw_avg_avg 1.30405 0.12445 10.478 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9651 on 9995 degrees of freedom
## Multiple R-squared: 0.01581, Adjusted R-squared: 0.01542
## F-statistic: 40.15 on 4 and 9995 DF, p-value: < 2.2e-16
#Rerun the model by removing the variables with p-value > 0.05
summary(lm(data = keys, shares ~ kw_max_avg+ kw_avg_avg))
##
## Call:
## lm(formula = shares ~ kw_max_avg + kw_avg_avg, data = keys)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27910 -2109 -1357 -234 648185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -131.49908 292.24325 -0.450 0.653
## kw_max_avg -0.13357 0.02667 -5.009 5.58e-07 ***
## kw_avg_avg 1.30595 0.12443 10.495 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9651 on 9997 degrees of freedom
## Multiple R-squared: 0.01565, Adjusted R-squared: 0.01545
## F-statistic: 79.46 on 2 and 9997 DF, p-value: < 2.2e-16
From the model above, we will keep kw_max_avg and kw_avg_avg.
m7 = lm(data = nlp, shares ~ .)
summary(m7)
##
## Call:
## lm(formula = shares ~ ., data = nlp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7912 -2269 -1279 -116 647368
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3528.01 709.16 4.975 6.64e-07 ***
## LDA_00 1078.88 453.45 2.379 0.017366 *
## LDA_01 -633.20 511.96 -1.237 0.216187
## LDA_02 -987.42 453.11 -2.179 0.029338 *
## LDA_03 1848.59 426.07 4.339 1.45e-05 ***
## LDA_04 NA NA NA NA
## global_subjectivity 6566.84 1378.34 4.764 1.92e-06 ***
## global_sentiment_polarity 1989.55 2766.66 0.719 0.472085
## global_rate_positive_words -9734.96 11824.99 -0.823 0.410384
## global_rate_negative_words 12875.47 22951.96 0.561 0.574827
## rate_positive_words -3794.68 1069.25 -3.549 0.000389 ***
## rate_negative_words -3677.28 1629.28 -2.257 0.024030 *
## avg_positive_polarity -2125.51 2270.23 -0.936 0.349165
## min_positive_polarity 1016.15 1817.12 0.559 0.576030
## max_positive_polarity -583.57 685.25 -0.852 0.394444
## avg_negative_polarity -2525.79 2077.39 -1.216 0.224071
## min_negative_polarity 145.44 708.91 0.205 0.837446
## max_negative_polarity 177.77 1725.69 0.103 0.917953
## title_subjectivity 87.08 356.31 0.244 0.806926
## title_sentiment_polarity 1029.88 388.14 2.653 0.007982 **
## abs_title_subjectivity 1396.15 609.85 2.289 0.022079 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9663 on 9980 degrees of freedom
## Multiple R-squared: 0.0149, Adjusted R-squared: 0.01303
## F-statistic: 7.946 on 19 and 9980 DF, p-value: < 2.2e-16
#Since in this category we also have a lot of variable, we will divide it further into sub categories:
#Create a sub-category 1 for LDA:
lda = dsmall %>%
select(LDA_00, LDA_01, LDA_02, LDA_03, LDA_04, shares)
m8 = lm(data = lda, shares ~ .)
summary(m8)
##
## Call:
## lm(formula = shares ~ ., data = lda)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5015 -2190 -1349 -306 649250
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2814.5 275.2 10.226 < 2e-16 ***
## LDA_00 874.0 449.2 1.945 0.05175 .
## LDA_01 -484.9 501.5 -0.967 0.33358
## LDA_02 -1211.1 432.4 -2.801 0.00511 **
## LDA_03 2540.1 402.6 6.310 2.91e-10 ***
## LDA_04 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9683 on 9995 degrees of freedom
## Multiple R-squared: 0.009271, Adjusted R-squared: 0.008874
## F-statistic: 23.38 on 4 and 9995 DF, p-value: < 2.2e-16
#Since we have NA in the results, we will try to remove LDA_00, LDA_01 (with the highest p-value) and rerun
lda = dsmall %>%
select(LDA_02, LDA_03, LDA_04, shares)
m9 = lm(data = lda, shares ~ .)
summary(m9)
##
## Call:
## lm(formula = shares ~ ., data = lda)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4970 -2195 -1360 -274 649454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3132.9 232.1 13.496 < 2e-16 ***
## LDA_02 -1531.6 395.6 -3.871 0.000109 ***
## LDA_03 2165.7 393.1 5.509 3.69e-08 ***
## LDA_04 -313.2 391.8 -0.799 0.424055
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9686 on 9996 degrees of freedom
## Multiple R-squared: 0.008627, Adjusted R-squared: 0.008329
## F-statistic: 29 on 3 and 9996 DF, p-value: < 2.2e-16
We will try to run a model with only LDA_02 and LDA_03 to see if both are really significant
summary(lm(data = lda, shares~ LDA_02+ LDA_03))
##
## Call:
## lm(formula = shares ~ LDA_02 + LDA_03, data = lda)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4981 -2175 -1367 -306 649551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2998.6 160.2 18.713 < 2e-16 ***
## LDA_02 -1406.2 363.2 -3.872 0.000109 ***
## LDA_03 2315.2 345.8 6.695 2.28e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9686 on 9997 degrees of freedom
## Multiple R-squared: 0.008564, Adjusted R-squared: 0.008365
## F-statistic: 43.17 on 2 and 9997 DF, p-value: < 2.2e-16
From the results, we will keep both LDA_02 and LDA_03.
#Sub-category 2 measuring polarity
pol = dsmall %>%
select(global_subjectivity, global_sentiment_polarity, global_rate_positive_words, global_rate_negative_words, rate_positive_words, rate_negative_words, avg_positive_polarity, min_positive_polarity, max_positive_polarity, avg_negative_polarity, min_negative_polarity, max_negative_polarity, title_subjectivity, title_sentiment_polarity, abs_title_subjectivity, shares)
m10 = lm(data = pol, shares ~ .)
summary(m10)
##
## Call:
## lm(formula = shares ~ ., data = pol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7467 -2276 -1419 -219 647607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4230.9 633.5 6.678 2.55e-11 ***
## global_subjectivity 7699.0 1360.6 5.659 1.57e-08 ***
## global_sentiment_polarity 2235.7 2768.6 0.808 0.41939
## global_rate_positive_words -9695.2 11784.2 -0.823 0.41068
## global_rate_negative_words 21857.9 22758.9 0.960 0.33687
## rate_positive_words -5224.0 1036.8 -5.039 4.77e-07 ***
## rate_negative_words -5850.4 1577.6 -3.708 0.00021 ***
## avg_positive_polarity -1300.1 2269.3 -0.573 0.56672
## min_positive_polarity 1376.7 1807.3 0.762 0.44620
## max_positive_polarity -543.7 686.1 -0.792 0.42810
## avg_negative_polarity -3689.5 2056.2 -1.794 0.07279 .
## min_negative_polarity 534.1 705.1 0.758 0.44875
## max_negative_polarity 244.2 1726.9 0.141 0.88756
## title_subjectivity 177.5 356.0 0.499 0.61810
## title_sentiment_polarity 1102.4 388.9 2.835 0.00460 **
## abs_title_subjectivity 1521.6 610.1 2.494 0.01265 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9686 on 9984 degrees of freedom
## Multiple R-squared: 0.009781, Adjusted R-squared: 0.008293
## F-statistic: 6.574 on 15 and 9984 DF, p-value: 2.864e-14
#Again, run the model with the variables that have p-value < 0.05.
pol2 = dsmall %>%
select(global_subjectivity, rate_positive_words, rate_negative_words, avg_negative_polarity, title_sentiment_polarity, abs_title_subjectivity, shares)
m11 = lm(data = pol2, shares ~ .)
summary(m11)
##
## Call:
## lm(formula = shares ~ ., data = pol2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6304 -2288 -1439 -221 647871
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4337.5 599.2 7.238 4.87e-13 ***
## global_subjectivity 7921.9 1172.2 6.758 1.48e-11 ***
## rate_positive_words -5993.6 804.3 -7.452 9.98e-14 ***
## rate_negative_words -5788.2 895.3 -6.465 1.06e-10 ***
## avg_negative_polarity -2182.2 903.5 -2.415 0.01574 *
## title_sentiment_polarity 1090.2 381.9 2.855 0.00431 **
## abs_title_subjectivity 1375.2 532.5 2.582 0.00983 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9684 on 9993 degrees of freedom
## Multiple R-squared: 0.009223, Adjusted R-squared: 0.008628
## F-statistic: 15.5 on 6 and 9993 DF, p-value: < 2.2e-16
For this second part, we’ll keep global_subjectivity, rate_positive_words, rate_negative_words, avg_negative_polarity, title_sentiment_polarity, abs_title_subjectivity, shares.
We will use the step-wise function to identify the most significant variables out of the ones resulting from the previous “block” analysis
#Create a new dataset with the variables that we have selected
regData = dsmall %>%
select(n_unique_tokens, n_non_stop_unique_tokens, average_token_length, n_tokens_content, num_hrefs, num_self_hrefs, self_reference_min_shares, num_imgs, num_videos, is_weekend , kw_max_avg, kw_avg_avg, LDA_03, LDA_02, global_subjectivity, rate_positive_words, rate_negative_words, avg_negative_polarity, title_sentiment_polarity, abs_title_subjectivity, shares)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
null= lm(data=regData, shares ~ 1)
full = lm(data=regData, shares ~ .)
step = stepAIC(null, scope=list(lower=null, upper=full), direction = "forward")
## Start: AIC=183653.2
## shares ~ 1
##
## Df Sum of Sq RSS AIC
## + self_reference_min_shares 1 1.7624e+10 9.2833e+11 183467
## + kw_avg_avg 1 1.2466e+10 9.3349e+11 183523
## + LDA_03 1 6.6944e+09 9.3926e+11 183584
## + kw_max_avg 1 4.5429e+09 9.4141e+11 183607
## + LDA_02 1 3.8961e+09 9.4206e+11 183614
## + global_subjectivity 1 1.6742e+09 9.4428e+11 183637
## + average_token_length 1 1.3076e+09 9.4465e+11 183641
## + num_hrefs 1 1.2795e+09 9.4468e+11 183642
## + avg_negative_polarity 1 9.9203e+08 9.4496e+11 183645
## + num_imgs 1 9.6777e+08 9.4499e+11 183645
## + title_sentiment_polarity 1 6.1864e+08 9.4534e+11 183649
## + is_weekend 1 4.5474e+08 9.4550e+11 183650
## + n_tokens_content 1 3.4937e+08 9.4561e+11 183652
## + num_videos 1 3.4267e+08 9.4561e+11 183652
## + abs_title_subjectivity 1 2.8085e+08 9.4567e+11 183652
## + n_non_stop_unique_tokens 1 2.6692e+08 9.4569e+11 183652
## + rate_positive_words 1 2.4199e+08 9.4571e+11 183653
## <none> 9.4596e+11 183653
## + rate_negative_words 1 1.5979e+08 9.4580e+11 183654
## + num_self_hrefs 1 3.4131e+07 9.4592e+11 183655
## + n_unique_tokens 1 4.3950e+06 9.4595e+11 183655
##
## Step: AIC=183467.1
## shares ~ self_reference_min_shares
##
## Df Sum of Sq RSS AIC
## + kw_avg_avg 1 9261479023 9.1907e+11 183369
## + LDA_03 1 6072682385 9.2226e+11 183404
## + LDA_02 1 3230087597 9.2510e+11 183434
## + kw_max_avg 1 2607956330 9.2572e+11 183441
## + average_token_length 1 1636550645 9.2669e+11 183451
## + num_hrefs 1 1301601281 9.2703e+11 183455
## + global_subjectivity 1 1173555083 9.2716e+11 183456
## + num_imgs 1 912920403 9.2742e+11 183459
## + avg_negative_polarity 1 772751755 9.2756e+11 183461
## + title_sentiment_polarity 1 631614916 9.2770e+11 183462
## + n_non_stop_unique_tokens 1 480894368 9.2785e+11 183464
## + is_weekend 1 480679156 9.2785e+11 183464
## + num_videos 1 402943358 9.2793e+11 183465
## + abs_title_subjectivity 1 306088646 9.2803e+11 183466
## + rate_positive_words 1 304643514 9.2803e+11 183466
## + rate_negative_words 1 232618574 9.2810e+11 183467
## + n_tokens_content 1 195498759 9.2814e+11 183467
## <none> 9.2833e+11 183467
## + n_unique_tokens 1 21773542 9.2831e+11 183469
## + num_self_hrefs 1 4486089 9.2833e+11 183469
##
## Step: AIC=183368.9
## shares ~ self_reference_min_shares + kw_avg_avg
##
## Df Sum of Sq RSS AIC
## + kw_max_avg 1 2891029781 9.1618e+11 183339
## + LDA_03 1 1819256044 9.1725e+11 183351
## + LDA_02 1 1144589054 9.1793e+11 183358
## + average_token_length 1 855659040 9.1821e+11 183362
## + global_subjectivity 1 737201041 9.1833e+11 183363
## + num_hrefs 1 519432895 9.1855e+11 183365
## + title_sentiment_polarity 1 424538302 9.1865e+11 183366
## + avg_negative_polarity 1 414657026 9.1866e+11 183366
## + abs_title_subjectivity 1 381084214 9.1869e+11 183367
## + num_imgs 1 309000951 9.1876e+11 183368
## + is_weekend 1 298413233 9.1877e+11 183368
## <none> 9.1907e+11 183369
## + n_non_stop_unique_tokens 1 158908780 9.1891e+11 183369
## + rate_negative_words 1 130615678 9.1894e+11 183369
## + rate_positive_words 1 99576521 9.1897e+11 183370
## + num_videos 1 84893840 9.1898e+11 183370
## + n_tokens_content 1 50381270 9.1902e+11 183370
## + n_unique_tokens 1 4617967 9.1907e+11 183371
## + num_self_hrefs 1 3451655 9.1907e+11 183371
##
## Step: AIC=183339.4
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg
##
## Df Sum of Sq RSS AIC
## + LDA_03 1 754531099 9.1542e+11 183333
## + global_subjectivity 1 648783454 9.1553e+11 183334
## + average_token_length 1 532539330 9.1565e+11 183336
## + LDA_02 1 522478455 9.1566e+11 183336
## + num_hrefs 1 424432441 9.1575e+11 183337
## + title_sentiment_polarity 1 400275352 9.1578e+11 183337
## + abs_title_subjectivity 1 372119630 9.1581e+11 183337
## + avg_negative_polarity 1 349069360 9.1583e+11 183338
## + is_weekend 1 259694181 9.1592e+11 183339
## <none> 9.1618e+11 183339
## + num_imgs 1 165271159 9.1601e+11 183340
## + rate_negative_words 1 81706676 9.1610e+11 183340
## + n_non_stop_unique_tokens 1 63746334 9.1612e+11 183341
## + rate_positive_words 1 39072809 9.1614e+11 183341
## + num_videos 1 29199118 9.1615e+11 183341
## + n_tokens_content 1 13697536 9.1617e+11 183341
## + num_self_hrefs 1 1942967 9.1618e+11 183341
## + n_unique_tokens 1 349580 9.1618e+11 183341
##
## Step: AIC=183333.1
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg +
## LDA_03
##
## Df Sum of Sq RSS AIC
## + global_subjectivity 1 578956953 9.1485e+11 183329
## + average_token_length 1 401036239 9.1502e+11 183331
## + title_sentiment_polarity 1 394544061 9.1503e+11 183331
## + abs_title_subjectivity 1 371354003 9.1505e+11 183331
## + num_hrefs 1 345774501 9.1508e+11 183331
## + LDA_02 1 308393023 9.1512e+11 183332
## + avg_negative_polarity 1 266847149 9.1516e+11 183332
## + is_weekend 1 248037755 9.1518e+11 183332
## <none> 9.1542e+11 183333
## + rate_negative_words 1 88844302 9.1534e+11 183334
## + num_imgs 1 86783248 9.1534e+11 183334
## + n_non_stop_unique_tokens 1 36702838 9.1539e+11 183335
## + rate_positive_words 1 9893647 9.1541e+11 183335
## + n_unique_tokens 1 2126776 9.1542e+11 183335
## + num_videos 1 1343418 9.1542e+11 183335
## + n_tokens_content 1 230405 9.1542e+11 183335
## + num_self_hrefs 1 189094 9.1542e+11 183335
##
## Step: AIC=183328.8
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg +
## LDA_03 + global_subjectivity
##
## Df Sum of Sq RSS AIC
## + average_token_length 1 1961546251 9.1288e+11 183309
## + n_non_stop_unique_tokens 1 482050530 9.1436e+11 183326
## + abs_title_subjectivity 1 374372331 9.1447e+11 183327
## + title_sentiment_polarity 1 368699153 9.1448e+11 183327
## + rate_positive_words 1 305436932 9.1454e+11 183327
## + is_weekend 1 236455112 9.1461e+11 183328
## + n_unique_tokens 1 213924854 9.1463e+11 183328
## + num_hrefs 1 199983754 9.1465e+11 183329
## + LDA_02 1 191180220 9.1465e+11 183329
## <none> 9.1485e+11 183329
## + rate_negative_words 1 158765158 9.1469e+11 183329
## + num_imgs 1 56723615 9.1479e+11 183330
## + avg_negative_polarity 1 41222502 9.1480e+11 183330
## + n_tokens_content 1 15611257 9.1483e+11 183331
## + num_self_hrefs 1 10416129 9.1483e+11 183331
## + num_videos 1 6714064 9.1484e+11 183331
##
## Step: AIC=183309.3
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg +
## LDA_03 + global_subjectivity + average_token_length
##
## Df Sum of Sq RSS AIC
## + num_hrefs 1 525751997 9.1236e+11 183306
## + abs_title_subjectivity 1 448993607 9.1243e+11 183306
## + title_sentiment_polarity 1 273145684 9.1261e+11 183308
## + is_weekend 1 249367721 9.1263e+11 183309
## <none> 9.1288e+11 183309
## + avg_negative_polarity 1 140520049 9.1274e+11 183310
## + n_unique_tokens 1 101520064 9.1278e+11 183310
## + num_imgs 1 98252035 9.1279e+11 183310
## + LDA_02 1 51202913 9.1283e+11 183311
## + n_non_stop_unique_tokens 1 4832472 9.1288e+11 183311
## + num_videos 1 4433457 9.1288e+11 183311
## + rate_positive_words 1 1104713 9.1288e+11 183311
## + n_tokens_content 1 943086 9.1288e+11 183311
## + num_self_hrefs 1 588924 9.1288e+11 183311
## + rate_negative_words 1 529928 9.1288e+11 183311
##
## Step: AIC=183305.6
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg +
## LDA_03 + global_subjectivity + average_token_length + num_hrefs
##
## Df Sum of Sq RSS AIC
## + n_unique_tokens 1 464109349 9.1189e+11 183302
## + abs_title_subjectivity 1 443667517 9.1191e+11 183303
## + title_sentiment_polarity 1 242749403 9.1212e+11 183305
## + is_weekend 1 195360065 9.1216e+11 183305
## <none> 9.1236e+11 183306
## + n_non_stop_unique_tokens 1 168143118 9.1219e+11 183306
## + n_tokens_content 1 149328442 9.1221e+11 183306
## + num_self_hrefs 1 129594404 9.1223e+11 183306
## + avg_negative_polarity 1 123432778 9.1223e+11 183306
## + LDA_02 1 67786118 9.1229e+11 183307
## + num_videos 1 16085170 9.1234e+11 183307
## + num_imgs 1 7728638 9.1235e+11 183307
## + rate_negative_words 1 1751509 9.1236e+11 183308
## + rate_positive_words 1 143432 9.1236e+11 183308
##
## Step: AIC=183302.5
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg +
## LDA_03 + global_subjectivity + average_token_length + num_hrefs +
## n_unique_tokens
##
## Df Sum of Sq RSS AIC
## + abs_title_subjectivity 1 455384929 9.1144e+11 183299
## + title_sentiment_polarity 1 263450931 9.1163e+11 183302
## + is_weekend 1 198389536 9.1170e+11 183302
## <none> 9.1189e+11 183302
## + n_non_stop_unique_tokens 1 161904775 9.1173e+11 183303
## + avg_negative_polarity 1 142182718 9.1175e+11 183303
## + num_self_hrefs 1 115956730 9.1178e+11 183303
## + num_imgs 1 103641992 9.1179e+11 183303
## + LDA_02 1 35289104 9.1186e+11 183304
## + num_videos 1 21697807 9.1187e+11 183304
## + rate_positive_words 1 5565899 9.1189e+11 183304
## + rate_negative_words 1 4290101 9.1189e+11 183304
## + n_tokens_content 1 2395209 9.1189e+11 183304
##
## Step: AIC=183299.5
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg +
## LDA_03 + global_subjectivity + average_token_length + num_hrefs +
## n_unique_tokens + abs_title_subjectivity
##
## Df Sum of Sq RSS AIC
## + title_sentiment_polarity 1 479539027 9.1096e+11 183296
## + is_weekend 1 212840430 9.1123e+11 183299
## <none> 9.1144e+11 183299
## + n_non_stop_unique_tokens 1 181204661 9.1126e+11 183300
## + avg_negative_polarity 1 139125824 9.1130e+11 183300
## + num_self_hrefs 1 116034695 9.1132e+11 183300
## + num_imgs 1 111880959 9.1133e+11 183300
## + LDA_02 1 50652389 9.1139e+11 183301
## + num_videos 1 17154284 9.1142e+11 183301
## + n_tokens_content 1 3471924 9.1144e+11 183301
## + rate_positive_words 1 2174158 9.1144e+11 183301
## + rate_negative_words 1 1694519 9.1144e+11 183301
##
## Step: AIC=183296.2
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg +
## LDA_03 + global_subjectivity + average_token_length + num_hrefs +
## n_unique_tokens + abs_title_subjectivity + title_sentiment_polarity
##
## Df Sum of Sq RSS AIC
## + avg_negative_polarity 1 224796621 9.1073e+11 183296
## + is_weekend 1 210770015 9.1075e+11 183296
## <none> 9.1096e+11 183296
## + n_non_stop_unique_tokens 1 175387398 9.1078e+11 183296
## + num_self_hrefs 1 122646171 9.1084e+11 183297
## + num_imgs 1 105435505 9.1085e+11 183297
## + LDA_02 1 35399106 9.1092e+11 183298
## + rate_positive_words 1 33719598 9.1093e+11 183298
## + rate_negative_words 1 30981234 9.1093e+11 183298
## + num_videos 1 17249752 9.1094e+11 183298
## + n_tokens_content 1 3269191 9.1096e+11 183298
##
## Step: AIC=183295.8
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg +
## LDA_03 + global_subjectivity + average_token_length + num_hrefs +
## n_unique_tokens + abs_title_subjectivity + title_sentiment_polarity +
## avg_negative_polarity
##
## Df Sum of Sq RSS AIC
## + n_non_stop_unique_tokens 1 210141605 9.1052e+11 183295
## + is_weekend 1 207824509 9.1053e+11 183295
## <none> 9.1073e+11 183296
## + num_self_hrefs 1 116082594 9.1062e+11 183296
## + num_imgs 1 105366838 9.1063e+11 183297
## + LDA_02 1 42919130 9.1069e+11 183297
## + num_videos 1 27455246 9.1071e+11 183297
## + rate_positive_words 1 4139532 9.1073e+11 183298
## + rate_negative_words 1 1742642 9.1073e+11 183298
## + n_tokens_content 1 752353 9.1073e+11 183298
##
## Step: AIC=183295.5
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg +
## LDA_03 + global_subjectivity + average_token_length + num_hrefs +
## n_unique_tokens + abs_title_subjectivity + title_sentiment_polarity +
## avg_negative_polarity + n_non_stop_unique_tokens
##
## Df Sum of Sq RSS AIC
## + is_weekend 1 200004774 9.1032e+11 183295
## <none> 9.1052e+11 183295
## + num_self_hrefs 1 102586799 9.1042e+11 183296
## + n_tokens_content 1 54690929 9.1047e+11 183297
## + num_imgs 1 45770362 9.1048e+11 183297
## + LDA_02 1 42049263 9.1048e+11 183297
## + num_videos 1 16449091 9.1051e+11 183297
## + rate_negative_words 1 6524555 9.1052e+11 183297
## + rate_positive_words 1 904379 9.1052e+11 183297
##
## Step: AIC=183295.3
## shares ~ self_reference_min_shares + kw_avg_avg + kw_max_avg +
## LDA_03 + global_subjectivity + average_token_length + num_hrefs +
## n_unique_tokens + abs_title_subjectivity + title_sentiment_polarity +
## avg_negative_polarity + n_non_stop_unique_tokens + is_weekend
##
## Df Sum of Sq RSS AIC
## <none> 9.1032e+11 183295
## + num_self_hrefs 1 107928560 9.1022e+11 183296
## + n_tokens_content 1 46036605 9.1028e+11 183297
## + LDA_02 1 45348893 9.1028e+11 183297
## + num_imgs 1 41377684 9.1028e+11 183297
## + num_videos 1 12287905 9.1031e+11 183297
## + rate_negative_words 1 5350473 9.1032e+11 183297
## + rate_positive_words 1 447435 9.1032e+11 183297
summary(step)
##
## Call:
## lm(formula = shares ~ self_reference_min_shares + kw_avg_avg +
## kw_max_avg + LDA_03 + global_subjectivity + average_token_length +
## num_hrefs + n_unique_tokens + abs_title_subjectivity + title_sentiment_polarity +
## avg_negative_polarity + n_non_stop_unique_tokens + is_weekend,
## data = regData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42296 -2126 -1122 -1 628950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.275e+03 6.968e+02 1.829 0.067395 .
## self_reference_min_shares 6.145e-02 4.788e-03 12.835 < 2e-16 ***
## kw_avg_avg 8.891e-01 1.420e-01 6.260 4.01e-10 ***
## kw_max_avg -1.003e-01 2.803e-02 -3.578 0.000348 ***
## LDA_03 2.156e+02 4.016e+02 0.537 0.591457
## global_subjectivity 4.224e+03 1.123e+03 3.762 0.000169 ***
## average_token_length -1.002e+03 1.944e+02 -5.156 2.57e-07 ***
## num_hrefs 2.767e+01 1.012e+01 2.734 0.006274 **
## n_unique_tokens 5.379e+03 2.186e+03 2.460 0.013895 *
## abs_title_subjectivity 1.491e+03 5.258e+02 2.835 0.004585 **
## title_sentiment_polarity 9.269e+02 3.729e+02 2.486 0.012942 *
## avg_negative_polarity -1.445e+03 8.630e+02 -1.675 0.093970 .
## n_non_stop_unique_tokens -3.061e+03 2.055e+03 -1.490 0.136316
## is_weekend 4.198e+02 2.834e+02 1.481 0.138581
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9548 on 9986 degrees of freedom
## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03641
## F-statistic: 30.07 on 13 and 9986 DF, p-value: < 2.2e-16
detach("package:MASS")
Since we have a very low R-squared / Adjusted R-Squared, we will try to visualize the standardized residuals:
qplot(predict(step), rstandard(step), geom="point", xlim = c(0,10000)) + geom_hline(yintercept=0, colour=I("blue"), alpha=I(0.5))
## Warning: Removed 50 rows containing missing values (geom_point).
#We will try to remove the variables with the highest p-value and rerun
mod = lm(formula = shares ~ self_reference_min_shares + kw_avg_avg +
kw_max_avg + global_subjectivity + average_token_length +
num_hrefs + n_unique_tokens + abs_title_subjectivity + title_sentiment_polarity,
data = regData)
summary(mod)
##
## Call:
## lm(formula = shares ~ self_reference_min_shares + kw_avg_avg +
## kw_max_avg + global_subjectivity + average_token_length +
## num_hrefs + n_unique_tokens + abs_title_subjectivity + title_sentiment_polarity,
## data = regData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42304 -2146 -1148 4 629298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.058e+03 6.757e+02 1.566 0.117345
## self_reference_min_shares 6.142e-02 4.788e-03 12.827 < 2e-16 ***
## kw_avg_avg 9.837e-01 1.298e-01 7.577 3.84e-14 ***
## kw_max_avg -1.128e-01 2.696e-02 -4.184 2.88e-05 ***
## global_subjectivity 4.654e+03 1.048e+03 4.440 9.10e-06 ***
## average_token_length -1.080e+03 1.833e+02 -5.889 4.02e-09 ***
## num_hrefs 3.258e+01 9.897e+00 3.292 0.000998 ***
## n_unique_tokens 2.638e+03 1.013e+03 2.604 0.009223 **
## abs_title_subjectivity 1.431e+03 5.253e+02 2.724 0.006469 **
## title_sentiment_polarity 8.451e+02 3.693e+02 2.288 0.022134 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9550 on 9990 degrees of freedom
## Multiple R-squared: 0.03687, Adjusted R-squared: 0.036
## F-statistic: 42.49 on 9 and 9990 DF, p-value: < 2.2e-16
#Again, remove the variables with a high p-value
summary(lm(formula = shares ~ self_reference_min_shares + kw_avg_avg +
kw_max_avg + global_subjectivity + average_token_length +
num_hrefs + n_unique_tokens + abs_title_subjectivity, data = regData))
##
## Call:
## lm(formula = shares ~ self_reference_min_shares + kw_avg_avg +
## kw_max_avg + global_subjectivity + average_token_length +
## num_hrefs + n_unique_tokens + abs_title_subjectivity, data = regData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42055 -2153 -1158 0 629293
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.230e+03 6.716e+02 1.831 0.067183 .
## self_reference_min_shares 6.135e-02 4.789e-03 12.811 < 2e-16 ***
## kw_avg_avg 9.885e-01 1.298e-01 7.614 2.91e-14 ***
## kw_max_avg -1.129e-01 2.696e-02 -4.189 2.83e-05 ***
## global_subjectivity 4.791e+03 1.047e+03 4.577 4.78e-06 ***
## average_token_length -1.091e+03 1.833e+02 -5.954 2.71e-09 ***
## num_hrefs 3.317e+01 9.896e+00 3.352 0.000804 ***
## n_unique_tokens 2.557e+03 1.013e+03 2.525 0.011592 *
## abs_title_subjectivity 1.146e+03 5.105e+02 2.246 0.024753 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9552 on 9991 degrees of freedom
## Multiple R-squared: 0.03637, Adjusted R-squared: 0.03559
## F-statistic: 47.13 on 8 and 9991 DF, p-value: < 2.2e-16
Our final model is the following:
model = lm(formula = shares ~ self_reference_min_shares + kw_avg_avg +
kw_max_avg + global_subjectivity + average_token_length +
num_hrefs + n_unique_tokens, data = regData)
summary(model)
##
## Call:
## lm(formula = shares ~ self_reference_min_shares + kw_avg_avg +
## kw_max_avg + global_subjectivity + average_token_length +
## num_hrefs + n_unique_tokens, data = regData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42338 -2137 -1170 -25 629505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.593e+03 6.519e+02 2.444 0.014538 *
## self_reference_min_shares 6.132e-02 4.790e-03 12.801 < 2e-16 ***
## kw_avg_avg 9.894e-01 1.299e-01 7.619 2.79e-14 ***
## kw_max_avg -1.137e-01 2.697e-02 -4.214 2.53e-05 ***
## global_subjectivity 4.732e+03 1.047e+03 4.521 6.22e-06 ***
## average_token_length -1.076e+03 1.832e+02 -5.872 4.45e-09 ***
## num_hrefs 3.320e+01 9.898e+00 3.355 0.000797 ***
## n_unique_tokens 2.531e+03 1.013e+03 2.499 0.012454 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9554 on 9992 degrees of freedom
## Multiple R-squared: 0.03588, Adjusted R-squared: 0.0352
## F-statistic: 53.12 on 7 and 9992 DF, p-value: < 2.2e-16
Since we had a low R-Squared, we decide to calculate the accuracy of our model to see how well it predicts the number of shares.
In order to measure the accuracy of the model, and taking into consideration the article “Predicting and Evaluating the Popularity of Online News”, we decided to use the Accuracy indicator that can be found in the table IV of the article.
We will split the variable “shares” into 2 groups of equal size and generate a prediction that will also be split into 2 groups of equal size. Then, the logic of the indicador is to measure how many of the predictions is correct (sum of the number of true positives and true negatives compared to the size of the whole sample).
#Here, we switch to the whole dataset instead of the sample dataset.
summary(data$shares)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 946 1400 3395 2800 843300
#We will add a column aboveMedian to classify the data according to the number of shares with respect to the median in the available dataset.
#We will switch to work with the whole dataset
data = data %>%
mutate(aboveMedian = ifelse(shares>= 1400, 1, -1))
fitted.results = predict(model,data,type='response')
#expected median
med = median(fitted.results)
fitted.results = ifelse(fitted.results >= med,1,-1)
misClasificError = mean(fitted.results != data$aboveMedian )
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.591665825850066"
qplot(predict(model), rstandard(model), geom="point", xlim= c(0,10000), ylim = c(-1,20)) + geom_hline(yintercept=0, colour=I("blue"), alpha=I(0.5))
## Warning: Removed 48 rows containing missing values (geom_point).
Residuals are not distributed in a symetrical way around the y-axis suggesting there is no clear linear relation between our predictor and the number of shares. There is a very evident tendency of the errors of being spreaded out more above the x-axis than below it.
This is related with the fact that the number shares is not evenly distributed as there are very few cases in which the number of shares is extremely high.
We check normality plotting a QQ-plot of the residual as well as a histogram of the residuals.
# Check normality using histogram
q1 = qplot(rstandard(model), geom="blank", xlim = c(-1,10)) +
geom_histogram(aes(y=..density..), colour=I("gray"), binwidth=0.2)+
stat_function(fun=dnorm, args=list(mean=0, sd=1),
colour=I("red"), alpha=I(0.5))
# Check normality using qqplot
q2 = qplot(sample=rstandard(model)) +
geom_abline(slope=1,intercept=0)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(q1, q2, nrow=1)
## Warning: Removed 28 rows containing non-finite values (stat_bin).
Residuals don’t seem to be normally distributed. The distribution of residuals is asymetrical and, even if it’s expected value is 0, there is a clear case of skewness.
qplot(predict(model), rstandard(model), geom="point", ylim = c(-10,20)) + geom_hline(yintercept=0) +
geom_hline(yintercept=2, colour = I("red"), alpha=I(0.5)) +
geom_hline(yintercept=-2, colour = I("red"), alpha=I(0.5))
## Warning: Removed 2 rows containing missing values (geom_point).
The homoscedasticity assumption is broken as the chart shows a significant amount of points outside of 2-sided standard deviations range of the graph.
Therefore, we will try to find the optimal transformation for the “y”" variable (shares) in order to obtain a homoscedastic model.
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
spreadLevelPlot(model)
## Warning in spreadLevelPlot.lm(model): 7 negative fitted values removed
##
## Suggested power transformation: -0.3020725
# the suggested transformation is -0.3020725
We will try to use the transformed variable and see if the model improves
regData = regData %>%
mutate(sharesPower = shares^-0.3020725)
model2 = lm(data = regData, formula = sharesPower ~ self_reference_min_shares + kw_avg_avg +
kw_max_avg + global_subjectivity + average_token_length +
num_hrefs + n_unique_tokens)
summary(model2)
##
## Call:
## lm(formula = sharesPower ~ self_reference_min_shares + kw_avg_avg +
## kw_max_avg + global_subjectivity + average_token_length +
## num_hrefs + n_unique_tokens, data = regData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.08901 -0.01641 0.00217 0.01615 0.50195
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.234e-01 1.861e-03 66.312 < 2e-16 ***
## self_reference_min_shares -8.805e-08 1.367e-08 -6.441 1.24e-10 ***
## kw_avg_avg -6.731e-06 3.706e-07 -18.162 < 2e-16 ***
## kw_max_avg 9.039e-07 7.696e-08 11.744 < 2e-16 ***
## global_subjectivity -1.825e-02 2.987e-03 -6.110 1.03e-09 ***
## average_token_length 1.409e-03 5.228e-04 2.695 0.007045 **
## num_hrefs -1.884e-04 2.825e-05 -6.670 2.69e-11 ***
## n_unique_tokens 1.026e-02 2.890e-03 3.551 0.000385 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02727 on 9992 degrees of freedom
## Multiple R-squared: 0.06759, Adjusted R-squared: 0.06694
## F-statistic: 103.5 on 7 and 9992 DF, p-value: < 2.2e-16
We have a slightly higher R-squared. We will continue using this model in the next stages.
qplot(predict(model2), rstandard(model2), geom="point", ylim = c(-10,20)) + geom_hline(yintercept=0) +
geom_hline(yintercept=2, colour = I("red"), alpha=I(0.5)) +
geom_hline(yintercept=-2, colour = I("red"), alpha=I(0.5))
We can see that with the new model, the homocedasticity assumption is respected.
The residuals might have been auto-correlated in the first years of Mashable (founded in 2005) since a very successful article might have led to a higher popularity of the website and therefore it could have influenced the popularity of future articles.
Hence, we want to check if the website is so popular, that it can be considered that there is no time-series effect of the increase in popularity of the website that depends on the number of shares of its articles.
We continue using model 2 to check the rest of the assumptions and run the durbinWatson test to check the independence of the residuals.
library(car)
durbinWatsonTest(model2)
## lag Autocorrelation D-W Statistic p-value
## 1 0.008251294 1.983457 0.402
## Alternative hypothesis: rho != 0
We have a high p-value: the residuals are not auto-correlated.
We performed a final validation of the model in order to calculate its performance comparing the training set and a testing set.
We first split the dataset:
library(caTools)
set.seed(17)
data = data %>%
mutate(sharesPower = shares^-0.3020725)
split = sample.split(data$sharesPower, SplitRatio = 0.8)
training = subset(data, split==TRUE)
testing = subset(data, split==FALSE)
We then re-calculate the model with the new training database.
fit <- lm(data= training, sharesPower ~ self_reference_min_shares + kw_avg_avg +
kw_max_avg + global_subjectivity + average_token_length +
num_hrefs + n_unique_tokens)
summary(fit)
##
## Call:
## lm(formula = sharesPower ~ self_reference_min_shares + kw_avg_avg +
## kw_max_avg + global_subjectivity + average_token_length +
## num_hrefs + n_unique_tokens, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.09283 -0.01671 0.00202 0.01660 0.90756
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.263e-01 1.069e-03 118.143 < 2e-16 ***
## self_reference_min_shares -5.894e-08 7.601e-09 -7.755 9.12e-15 ***
## kw_avg_avg -7.348e-06 2.134e-07 -34.432 < 2e-16 ***
## kw_max_avg 8.855e-07 4.386e-08 20.188 < 2e-16 ***
## global_subjectivity -2.003e-02 1.721e-03 -11.638 < 2e-16 ***
## average_token_length 2.512e-03 2.441e-04 10.289 < 2e-16 ***
## num_hrefs -1.903e-04 1.430e-05 -13.305 < 2e-16 ***
## n_unique_tokens -5.937e-05 3.984e-05 -1.490 0.136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02791 on 31809 degrees of freedom
## Multiple R-squared: 0.0712, Adjusted R-squared: 0.07099
## F-statistic: 348.3 on 7 and 31809 DF, p-value: < 2.2e-16
#We remove the average_token_lenght from the model because it has a very high p-value.
fit <- lm(data= training, sharesPower ~ self_reference_min_shares + kw_avg_avg +
kw_max_avg + global_subjectivity + average_token_length +
num_hrefs)
summary(fit)
##
## Call:
## lm(formula = sharesPower ~ self_reference_min_shares + kw_avg_avg +
## kw_max_avg + global_subjectivity + average_token_length +
## num_hrefs, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.09282 -0.01671 0.00203 0.01659 0.90756
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.263e-01 1.069e-03 118.144 < 2e-16 ***
## self_reference_min_shares -5.894e-08 7.601e-09 -7.755 9.1e-15 ***
## kw_avg_avg -7.351e-06 2.134e-07 -34.448 < 2e-16 ***
## kw_max_avg 8.859e-07 4.386e-08 20.199 < 2e-16 ***
## global_subjectivity -1.997e-02 1.720e-03 -11.606 < 2e-16 ***
## average_token_length 2.499e-03 2.440e-04 10.243 < 2e-16 ***
## num_hrefs -1.901e-04 1.430e-05 -13.292 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02791 on 31810 degrees of freedom
## Multiple R-squared: 0.07113, Adjusted R-squared: 0.07096
## F-statistic: 406 on 6 and 31810 DF, p-value: < 2.2e-16
Rsq.Training = summary(fit)$r.squared
And we finally evaluate the R-Squared for the testing dataset
SSE = (predict(fit, newdata = testing) - testing$sharesPower)^2 %>% sum()
SSTotal = (testing$sharesPower - mean(testing$sharesPower))^2 %>% sum()
Rsq.Testing = (SSTotal - SSE)/SSTotal
cat("R^2 Training = ", Rsq.Training, " vs R^2 Testing ", Rsq.Testing, ".", sep="")
## R^2 Training = 0.07113062 vs R^2 Testing 0.04879411.
Even if we found a final model that has a set of variables with very low p-values (lower than 10e-15), the final outcome is a model with an R-Squared barely above zero.
The predicted variable has a very particular distribution that has generated various challenges during the modelling process.
We have some ideas about additional actions that could improve the results.
The field shares could be transformed into 3 different categories: Low_Number_Shares (0), High_Number_Shares (1) and an indetermined zone.
The indetermined zone could be excluded from the database (as it is a grey area and might generate some distortion in the modelling process), whereas the Low_Number_Shares (0) and High_Number_Shares (1) distinction might help to distinguish in a clearer way characteristics that influcence the success of an article (measured in # of shares). The selection of the way to split the table has to respond to a logic process and has to take into account the distribution of the information.
As we would work with a binary response variable, it would be appropriate to work with a logit regression model.
It might be necessary to generate additional variables or transformations to increase the predicting capacity of the variables. We didn’t perform these tasks in this exercise because of the excessive number of independent variables to analyze (almost 60).